Resampling performance improvement and sparse aggregation columns support#3062
Conversation
419c30a to
0de92a2
Compare
a5ac868 to
a9e8ee4
Compare
36122bc to
4231a4f
Compare
5679aa0 to
4b7e881
Compare
210a17b to
086284c
Compare
086284c to
5e4edb7
Compare
ArcticDB Code Review SummaryDelta since last review is two new commits on the rebased branch ( Documentation
Tests
Notes (no action required)
|
0c2d98c to
6120021
Compare
#### Reference Issues/PRs Optimizations on top of #3091 Used in #3062 #### What does this implement or fix? Some micro optimizations on binary search methods: - Don't keep `TypedBlockData` in `ColumnDataIterator`. Instead only keep `block_data_` and `block_size_` - Don't recalculate block pointer and size when we already know them during gallop #### Any other comments? Benchmarks for all search and iteration methods: | Benchmark | Before (ns) | After (ns) | Delta | |---|---:|---:|---:| | iterate_irregular_blocks_1 (one row per block) | 478,496 | 311,163 | −35.0% | | iterate_with_iterator (100 rows) | 798 | 719 | −9.9% | | exponential_lb_single_block (in first 100) | 356 | 323 | −9.2% | | exponential_lb_single_block (full gallop) | 458 | 424 | −7.4% | | exponential_lb_regular (in first 100) | 364 | 339 | −6.7% | | exponential_lb_irregular_1000 (in first 100) | 360 | 335 | −6.7% | | exponential_lb_irregular_1000 (full gallop) | 496 | 476 | −3.9% | | exponential_lb_regular (full gallop) | 504 | 489 | −2.9% | | exponential_lb_irregular_1 (in first 100) | 464 | 455 | −2.0% | | exponential_lb_irregular_1 (full gallop) | 687 | 679 | −1.3% | | lower_bound_single_block | 411 | 394 | −4.1% | | lower_bound_irregular_1000 | 444 | 431 | −3.0% | | lower_bound_irregular_1 | 595 | 579 | −2.8% | | lower_bound_regular_blocks | 443 | 436 | −1.4% | | iterate_single_block | 27,305 | 27,247 | −0.2% | | iterate_regular_blocks | 29,051 | 28,734 | −1.1% | | iterate_irregular_blocks_1000 | 28,136 | 27,893 | −0.9% | | iterate_with_scalar_at (100 rows) | 182,183,122 | 182,088,026 | −0.1% | #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> Co-authored-by: Ivo <ivo.dilov@man.com>
5e4edb7 to
89d9fd8
Compare
There was a problem hiding this comment.
It would be nice if we can get hypothesis tests covering some basic scenarios against polars, no need to test all supported parameters as some are quite painful to test.
| size_t{0}, | ||
| [](size_t acc, const auto& col) { return acc + col->row_count(); } | ||
| ); | ||
| const auto max_output_rows = std::min(bucket_boundaries.size() - 1, total_input_rows); |
There was a problem hiding this comment.
How can you end up with bucket_boundaries.size() - 1 > total_input_rows
There was a problem hiding this comment.
If we have loads of empty buckets. E.g. use resample("1h") on a table which has a 24h frequency like 2026-01-01, 2026-01-02, 2026-01-03
Previously each of `generate_output_index_column`, `generate_resample_output_column` and `aggregate` had complicated logic to identify which row corresponds to which output column. This is simplified by creating a `ResampleMapping` when building the output index column to store which output row corresponds to which input values. Then `ResampleMapping` is used in the other methods.
A lot of resampling runtime was spent during generation of output index column. This can be sped up significantly in the common case where number of buckets is much smaller then input rows by using exponential binary search.
Helps speed up and decrease memory usage for the very rare case where num_buckets >> num_input_rows.
With benchmarking of various rows_per_bucket it was confirmed that exponential_search becomes faster than linear scan at around 32 elements. For <32 rows per bucket the linear pass is faster. For >32 the exponential search is faster.
Construct output agg column based on rs_index of input sparse columns. Then use sparse iterators to populate the values.
89d9fd8 to
25f608d
Compare
|
The |
Reference Issues/PRs
Monday ref: 11679866800
Depends on PRs #3091 and #3110
Issues
generate_output_index_column,generate_resampling_output_column,SortedAggregator::aggregateChanges (split per commit for easier review)
generate_output_index_columntosorted_aggregation.cpp.ResampleMappingingenerate_output_index_columnand use it directly in other methods.ResampleMappingjust has a mapping fromoutput_rowto(start_column_index, start_column_offset), (end_column_index, end_column_offset).generate_output_index_columnto skip past all rows in a single bucket at once.O(num_input_rows + num_buckets)toO(num_buckets × log(rows_per_bucket)).O(num_input_rows + num_buckets)even whennum_buckets ≥ num_input_rows.min(num_buckets, num_input_rows)instead ofnum_buckets.ResampleMappingfrom commit 2.Resample benchmark timings
BM_resample/<rows_per_seg>/<num_segs>/<num_buckets>/<num_cols>. Total rows ~1M.Source:
cpp/arcticdb/processing/test/benchmark_resample.cpp. Times in ms,--benchmark_min_time=2s.100k × 10, 1k buckets100k × 10, 10k buckets100k × 10, 100k buckets2k × 500, 100 buckets100k × 10, 10M buckets1 aggregation column
100 aggregation columns
Deltas vs baseline (row 0).
Notes on benchmark results
ARCTICDB_LIKELYandARCTICDB_UNLIKELY).